1 Climate change and temperature anomalies

If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here

To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.

Run the code below to load the file:

weather <- 
  read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv", 
           skip = 1, 
           na = "***")

Notice that, when using this function, we added two options: skip and na.

  1. The skip=1 option is there as the real data table only starts in Row 2, so we need to skip one row.
  2. na = "***" option informs R how missing observations in the spreadsheet are coded. When looking at the spreadsheet, you can see that missing data is coded as “***”. It is best to specify this here, as otherwise some of the data is not recognized as numeric data.

Once the data is loaded, notice that there is a object titled weather in the Environment panel. If you cannot see the panel (usually on the top-right), go to Tools > Global Options > Pane Layout and tick the checkbox next to Environment. Click on the weather object, and the dataframe will pop up on a seperate tab. Inspect the dataframe.

For each month and year, the dataframe shows the deviation of temperature from the normal (expected). Further the dataframe is in wide format.

You have two objectives in this section:

  1. Select the year and the twelve month variables from the weather dataset. We do not need the others (J-D, D-N, DJF, etc.) for this assignment. Hint: use select() function.

  2. Convert the dataframe from wide to ‘long’ format. Hint: use gather() or pivot_longer() function. Name the new dataframe as tidyweather, name the variable containing the name of the month as month, and the temperature deviation values as delta.

# Create tidyweather data
tidyweather <- weather %>% 
  select(-c("J-D", "D-N", "DJF", "JJA", "MAM", "SON")) %>% 
  pivot_longer(cols = -Year, names_to = "month", values_to = "delta")

Inspect your dataframe. It should have three variables now, one each for

  1. year,
  2. month, and
  3. delta, or temperature deviation.

1.1 Plotting Information

Let us plot the data using a time-series scatter plot, and add a trendline. To do that, we first need to create a new variable called date in order to ensure that the delta values are plot chronologically.

In the following chunk of code, I used the eval=FALSE argument, which does not run a chunk of code; I did so that you can knit the document before tidying the data and creating a new dataframe tidyweather. When you actually want to run this code and knit your document, you must delete eval=FALSE, not just here but in all chunks were eval=FALSE appears.

# Set date format tidyweather data
tidyweather <- tidyweather %>%
  mutate(date = ymd(paste(as.character(Year), month, "1")),
         month = month(date, label=TRUE),
         year = year(date))

# Plot weather anomalies
ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point()+
  geom_smooth(color="red") +
  theme_bw() +
  labs (
    title = "Weather Anomalies"
  )

Is the effect of increasing temperature more pronounced in some months? Use facet_wrap() to produce a seperate scatter plot for each month, again with a smoothing line. Your chart should human-readable labels; that is, each month should be labeled “Jan”, “Feb”, “Mar” (full or abbreviated month names are fine), not 1, 2, 3.

Answer:

The effect of increasing temperature is more pronounced in the winter months - starting from the end of fall until the start of spring. We see this from the higher values within the winter months over the other months across time. This reflects the effects of global warming/climate change throughout the last decade.

It is sometimes useful to group data into different time periods to study historical data. For example, we often refer to decades such as 1970s, 1980s, 1990s etc. to refer to a period of time. NASA calcuialtes a temperature anomaly, as difference form the base periof of 1951-1980. The code below creates a new data frame called comparison that groups data in five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.

We remove data before 1800 and before using filter. Then, we use the mutate function to create a new variable interval which contains information on which period each observation belongs to. We can assign the different periods using case_when().

# Assign periods
comparison <- tidyweather %>% 
  filter(Year>= 1881) %>%     #remove years prior to 1881
  #create new variable 'interval', and assign values based on criteria below:
  mutate(interval = case_when(
    Year %in% c(1881:1920) ~ "1881-1920",
    Year %in% c(1921:1950) ~ "1921-1950",
    Year %in% c(1951:1980) ~ "1951-1980",
    Year %in% c(1981:2010) ~ "1981-2010",
    TRUE ~ "2011-present"
  ))

Inspect the comparison dataframe by clicking on it in the Environment pane.

Now that we have the interval variable, we can create a density plot to study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in. Set fill to interval to group and colour the data by different time periods.

# Create density plot of monthly deviation based off each interval period
ggplot(comparison, aes(delta, fill = interval, alpha = 0.1)) +
  geom_density() +
  labs(title = "Denisty plots of monthly deviations", subtitle = "Separated by Intervals", x = "Delta", y = "Density") +
  theme(plot.title = element_text(size = 14, hjust = 0.5),
        plot.subtitle = element_text(size = 12, hjust = 0.5),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10)
)

NOTE:

From 1881 until 1980, we did not observe huge anomalies in terms of temperature delta from the base period (1951-1980). However, this was not reflected past 1980; temperatures began rising more drastically from 1981-present. More strikingly, the average delta for the last period is around 0.5 higher than that of the previous period (1981-2010). This hints at even higher average deviations in future that come faster unless a downward pressure is exerted on temperature increases through climate change measures.

So far, we have been working with monthly anomalies. However, we might be interested in average annual anomalies. We can do this by using group_by() and summarise(), followed by a scatter plot to display the result.

# Create yearly averages
average_annual_anomaly <- comparison %>%
  group_by(Year) %>% 
  summarise(mean_delta_year = mean(delta))

# plot mean annual delta across years:
ggplot(average_annual_anomaly, aes(x = Year, y= mean_delta_year)) +
  geom_point() +
  geom_smooth(colour = "red", method = "loess") +
  theme_bw() +
  labs (
    title = "Yearly Weather Anomalies",
    y = "Mean annual delta"
  )

#Fit the best fit line, using LOESS method
#change theme to theme_bw() to have white background + black frame around plot

NOTE:

Like what we were hinting at earlier, the yearly anomalies over the past 4 decades (starting from 1980) have followed a relatively exponential upward trend. During the 1980s, the average yearly delta was between 0 and 0.5 whereas this average has climbed past 1.5 in the most recent years. This increase of 1 degree overall has not been witnessed in the century prior to 1980, thus emphasising the magnitude of the rapidly worsening global warming phenomenon.

1.2 Confidence Interval for delta

NASA points out on their website that

A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.

Your task is to construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer package. Recall that the dataframe comparison has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.

# Formula method of CI calculation
formula_ci <- comparison %>%
  filter(interval == "2011-present") %>%
  drop_na() %>% 
  summarise(mean_delta = mean(delta),
                     sd_delta = sd(delta),
                     count = n(),
                     t_critical = qt(0.975, count-1),
                     se_delta = sd_delta / sqrt(count),
                     margin_of_error = t_critical * se_delta,
                     delta_low  = mean_delta - margin_of_error,
                     delta_high  = mean_delta + margin_of_error
            )
# Print out formula_CI
formula_ci
## # A tibble: 1 × 8
##   mean_delta sd_delta count t_critical se_delta margin_of_error delta_…¹ delta…²
##        <dbl>    <dbl> <int>      <dbl>    <dbl>           <dbl>    <dbl>   <dbl>
## 1       1.07    0.266   139       1.98   0.0226          0.0446     1.02    1.11
## # … with abbreviated variable names ¹​delta_low, ²​delta_high
library(infer)
# Bootstrap method of CI calculation using Infer package
infer_bootstrap_ci <- comparison %>%
  filter(interval == "2011-present") %>%
  specify(response = delta) %>%
  generate(reps = 1000,  type = "bootstrap") %>% 
  calculate(stat = "mean", na.rm = TRUE) %>% 
  get_confidence_interval(level = .95)
# Print out bootstrap CI using Infer package
infer_bootstrap_ci
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.02     1.11

What is the data showing us? Please type your answer after (and outside!) this blockquote. You have to explain what you have done, and the interpretation of the result. One paragraph max, please!

Answer:

The results give an interesting insight into the sea and land temperature differences during the 2011 - present time interval. The mean of the temperature delta is around 1.06 degrees. By means of a confidence interval calculation we can state with 95% certainty that the mean temperature delta is between 1.02 and 1.11 degrees for this time period. As there is no underlying distribution, we used the t-statistics and a degrees of freedom of n-1 to calculate the critical values required for a confidence interval calculation. We confirmed this estimate by using a bootstrapping method that manually created the confidence intervals and provided a lower bound of 1.02 degrees and a upper bound of 1.11 degrees.

2 Biden’s Approval Margins

As we saw in class, fivethirtyeight.com has detailed data on all polls that track the president’s approval

# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv') 

glimpse(approval_polllist)
## Rows: 4,495
## Columns: 22
## $ president           <chr> "Joe Biden", "Joe Biden", "Joe Biden", "Joe Biden"…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate           <chr> "9/9/2022", "9/9/2022", "9/9/2022", "9/9/2022", "9…
## $ startdate           <chr> "1/19/2021", "1/19/2021", "1/20/2021", "1/20/2021"…
## $ enddate             <chr> "1/21/2021", "1/21/2021", "1/21/2021", "1/21/2021"…
## $ pollster            <chr> "Rasmussen Reports/Pulse Opinion Research", "Morni…
## $ grade               <chr> "B", "B", "B", "B+", "B", "B-", "B+", "B-", "B", "…
## $ samplesize          <dbl> 1500, 15000, 1993, 1516, 15000, 1115, 941, 1200, 1…
## $ population          <chr> "lv", "a", "rv", "a", "a", "a", "rv", "rv", "a", "…
## $ weight              <dbl> 0.3382, 0.2594, 0.0930, 1.2454, 0.2333, 1.1014, 1.…
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve             <dbl> 48.0, 50.0, 56.0, 45.0, 51.0, 55.5, 63.0, 58.0, 52…
## $ disapprove          <dbl> 45.0, 28.0, 31.0, 28.0, 28.0, 31.6, 37.0, 32.0, 29…
## $ adjusted_approve    <dbl> 49.2, 49.4, 55.4, 46.0, 50.4, 54.6, 59.5, 57.5, 51…
## $ adjusted_disapprove <dbl> 40.3, 30.9, 33.9, 29.0, 30.9, 32.5, 38.4, 32.8, 31…
## $ multiversions       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tracking            <lgl> TRUE, TRUE, NA, NA, TRUE, NA, NA, NA, TRUE, TRUE, …
## $ url                 <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id             <dbl> 74247, 74272, 74246, 74327, 74273, 74248, 74256, 7…
## $ question_id         <dbl> 139395, 139491, 139394, 139570, 139492, 139404, 13…
## $ createddate         <chr> "1/22/2021", "1/28/2021", "1/22/2021", "2/2/2021",…
## $ timestamp           <chr> "09:48:31  9 Sep 2022", "09:48:31  9 Sep 2022", "0…
# Use `lubridate` to fix dates, as they are given as characters.

2.1 Create a plot

What I would like you to do is to calculate the average net approval rate (approve- disapprove) for each week since he got into office. I want you plot the net approval for each week in 2022, along with its 95% confidence interval. There are various dates given for each poll, please use enddate, i.e., the date the poll ended. Your plot should look something like this:

# Create summary statistics of polls grouped by subgroups and across weeks in 2022
poll_summary <- approval_polllist %>% 
  mutate(enddate = mdy(enddate)) %>% 
  mutate(net_approval = adjusted_approve - adjusted_disapprove) %>%
  mutate(year = lubridate::year(enddate), #Split date format into years and week
         week = lubridate::week(enddate)) %>%
  filter(year == 2022) %>%
  select(c(subgroup, adjusted_approve, adjusted_disapprove, net_approval, year, week)) %>% #Select variables of interest 
  group_by(week, subgroup) %>%
  summarise(mean_approval = mean(net_approval), #Create summary statisticcs and Confidence Intervals for each week and subgroup combination
            sd_approval = sd(net_approval),
            count = n(),
            t_critical = qt(0.975, count-1),
            se_approval = sd_approval / sqrt(count),
            margin_of_error = t_critical * se_approval,
            approval_low  = mean_approval - margin_of_error,
            approval_high  = mean_approval + margin_of_error)

# Plot overall confidence interval for net approval across weeks in 2022
# Facet across subgroup type
ggplot(poll_summary, aes(week, group = subgroup, colour = subgroup)) +
  geom_line(aes(y = mean_approval), show.legend = FALSE) +
  geom_line(aes(y = approval_low), show.legend = FALSE) +
  geom_line(aes(y = approval_high), show.legend = FALSE) +
  geom_ribbon(aes(ymin = approval_low, ymax = approval_high, group = subgroup, fill = "red", alpha = 0.3), show.legend = FALSE) + #Create ribbon between upper and lower bounds
  facet_grid(rows = vars(subgroup)) +
  labs(title = "Biden's Net Approval Rating in 2022", subtitle = "Weekly Data, Approve - Disapprove, %", x = "Week in 2022", y = "%") +
  theme_light() +
  theme(plot.title = element_text(size = 14, hjust = 0.5),
        plot.subtitle = element_text(size = 12, hjust = 0.5),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        strip.text.x = element_text(size = 10))

NOTE:

Among all voters, regardless of subgroups, Biden’s net approval ratings have consistently been negative and falling. His net approval spiked past week 29 of 2022 possibly due to a sentiment vote for Joe Biden testing positive for coronavirus on the 21st of July 2022. On a more serious note, more probable reasons could be the passing of the Inflation Reduction Act, and the approval of a search warrant of former president Donald Trump’s house in Florida for classified documents; these both happened at the beginning of August 2022.

3 Challenge 1: Excess rentals in TfL bike sharing

Recall the TfL data on how many bikes were hired every single day. We can get the latest data by running the following

url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"

# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2022-09-06T12%3A41%3A48/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20220911%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20220911T070936Z&X-Amz-Expires=300&X-Amz-Signature=d65c619d35d3fdeb465019e5ee6dc7ef4529195266da6d1099db04b6220152b7&X-Amz-SignedHeaders=host]
##   Date: 2022-09-11 07:09
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 180 kB
## <ON DISK>  C:\Users\ASUS\AppData\Local\Temp\RtmpaQ3mtm\file105d81edb7879.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
                   sheet = "Data",
                   range = cell_cols("A:B"))

# change dates to get year, month, and weeks
bike <- bike0 %>% 
  clean_names() %>% 
  rename (bikes_hired = number_of_bicycle_hires) %>% 
  mutate (year = year(day),
          month = lubridate::month(day, label = TRUE),
          week = isoweek(day))

We can easily create a facet grid that plots bikes hired by month and year since 2015

However, the challenge I want you to work on is to reproduce the following two graphs.

The second one looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).

For both of these graphs, you have to calculate the expected number of rentals per week or month between 2016-2019 and then, see how each week/month of 2020-2022 compares to the expected rentals. Think of the calculation excess_rentals = actual_rentals - expected_rentals.

Should you use the mean or the median to calculate your expected rentals? Why?

Answer:

We should use the mean to calculate expected rentals. In our case, we wanted to smooth (by averaging) all of the data, including anomalies, across the years for rentals within each month/week; this gives a clearer picture of an “average” for each time period since the median statistic would automatically exclude the effects of outliers.

In creating your plots, you may find these links useful:

# Create expected monthly bike rental by averaging monthly data from 2016-2019
expected_monthly_rental_summary <- bike %>% 
  filter(year >= 2016 & year <= 2019) %>% 
  group_by(month) %>% 
  summarise(expected_rental_mean = mean(bikes_hired, na.rm = TRUE))

# Create actual monthly bike rental data from 2017-2022
monthly_summary_17_22 <- bike %>% 
  filter(year >= 2017) %>% 
  group_by(year, month) %>% 
  summarise(monthly_mean = mean(bikes_hired))

# Join both tables
monthly_full_summary <- monthly_summary_17_22 %>% 
  left_join(y = expected_monthly_rental_summary, by = "month")

# Plot monthly changes in bike rentals faceted on year
ggplot(monthly_full_summary) +
  geom_line(aes(x = month, y = monthly_mean, group = year), colour = "black", show.legend = FALSE) + #Line created for actual bike rental
  geom_line(aes(x = month, y = expected_rental_mean, group = year), colour = "blue", size = 1.2, show.legend = FALSE) + #Line created for expected bike rental
  facet_wrap(~ year) +
  geom_ribbon(aes(x = month, ymin = monthly_mean, ymax = pmax(monthly_mean, expected_rental_mean), group = year, fill = "red"), alpha = 0.3, show.legend = FALSE) + #Ribbon created for expected>actual
  geom_ribbon(aes(x = month, ymin = expected_rental_mean, ymax = pmax(monthly_mean, expected_rental_mean), group = year, fill = "green"), alpha = 0.3, show.legend = FALSE)  + #Ribbon created using actual>expected
  scale_fill_manual(values=c("green", "red"), name="fill") + #change colour fill of ribbons
  labs(title = "Monthly Changes in TfL bike rentals", subtitle = "Change from monthly average calculated between 2016 annd 2019", x = "Month", y = "Bike Rentals") +
  theme_light() +
  theme(plot.title = element_text(size = 14, hjust = 0.5),
        plot.subtitle = element_text(size = 12, hjust = 0.5),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        axis.text.x = element_text(angle = 90),
        strip.text.x = element_text(size = 10)) 

# Create expected weekly bike rental by averaging weekly data from 2016-2019
expected_weekly_rental_summary <- bike %>% 
  filter(year >= 2016 & year <= 2019) %>% 
  group_by(week) %>% 
  summarise(expected_rentals = mean(bikes_hired, na.rm = TRUE))

# Create actual weekly bike rental data from 2017-2022
weekly_summary_17_22 <- bike %>% 
  filter(year >= 2017) %>% 
  group_by(year, week) %>% 
  summarise(weekly_mean = mean(bikes_hired))

# Join both tables and create new column measuring weekly percent deviation
weekly_full_summary <- weekly_summary_17_22 %>% 
  left_join(y = expected_weekly_rental_summary, by = "week") %>%
  mutate(percent_change_weekly = (weekly_mean/expected_rentals - 1) * 100) %>%
  filter(!(week >= 52 & year == 2022)) %>% #filter date anomalies (due to the way R defines week 52 and 53)
  mutate(up = ifelse(percent_change_weekly > 0, 1, "")) %>%
  mutate(down = ifelse(percent_change_weekly < 0, 1, "")) #Create up and down columns for rugs later
  
# Plot weekly percent deviation in bike rentals faceted on year
ggplot(weekly_full_summary, aes(x = week)) +
  facet_wrap(~ year) +
  geom_rect(aes(xmin = 14, xmax = 26, ymin = -50, ymax = 100), fill = "grey", alpha = 0.1) + #Create rectangle representing Q2
  geom_rect(aes(xmin = 40, xmax = 52, ymin = -50, ymax = 100), fill = "grey", alpha = 0.1) + #Create rectangle representing Q4
  geom_line(aes(y = percent_change_weekly), show.legend = FALSE) + 
  geom_ribbon(aes(ymin = 0, ymax = pmin(0,percent_change_weekly)), alpha = 0.3, fill = "red", show.legend = FALSE) +
  geom_ribbon(aes(ymin = percent_change_weekly, ymax = pmin(0,percent_change_weekly)), fill = "green", alpha = 0.3, show.legend = FALSE) + #Create ribbons
  geom_rug(data = subset(weekly_full_summary, up == 1), color = "green", sides = "b", show.legend = FALSE) + #Create green rugs
  geom_rug(data = subset(weekly_full_summary, down == 1), color = "red", sides = "b", show.legend = FALSE) + #Create red rugs
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  labs(title = "Weekly Changes in TfL bike rentals", subtitle = "% change from weekly average calculated between 2016 annd 2019", x = "Week", y = "Bike Rentals") +
  theme_light() +
  theme(plot.title = element_text(size = 14, hjust = 0.5),
        plot.subtitle = element_text(size = 12, hjust = 0.5),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        strip.text.x = element_text(size = 10))

NOTE:

We see that during the beginning of coronavirus lockdown periods, we observed a notable reduction in bike rentals. Conversely, past the lifting of general coronavirus measures in mid 2021, we have seen consistent positive excess rentals compared to the mean 2016-2019 bike rentals; this suggests that the local population was particularly inclined to get out of the house and into the fresh air by cycling more regularly. A similar pattern can be observed for the weekly analysis of bike rentals; bike rentals by more than 50%, as compared to the period from 2016-2019, in most weeks after the lockdown measures were lifted.

4 Challenge 2: Share of renewable energy production in the world

The National Bureau of Economic Research (NBER) has a a very interesting dataset on the adoption of about 200 technologies in more than 150 countries since 1800. This is theCross-country Historical Adoption of Technology (CHAT) dataset.

The following is a description of the variables

variable class description
variable character Variable name
label character Label for variable
iso3c character Country code
year double Year
group character Group (consumption/production)
category character Category
value double Value (related to label)
technology <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-19/technology.csv')

#get all technologies
labels <- technology %>% 
  distinct(variable, label)

# Get country names using 'countrycode' package
technology <- technology %>% 
  filter(iso3c != "XCD") %>% 
  mutate(iso3c = recode(iso3c, "ROM" = "ROU"),
         country = countrycode(iso3c, origin = "iso3c", destination = "country.name"),
         country = case_when(
           iso3c == "ANT" ~ "Netherlands Antilles",
           iso3c == "CSK" ~ "Czechoslovakia",
           iso3c == "XKX" ~ "Kosovo",
           TRUE           ~ country))

#make smaller dataframe on energy
energy <- technology %>% 
  filter(category == "Energy")

# download CO2 per capita from World Bank using {wbstats} package
# https://data.worldbank.org/indicator/EN.ATM.CO2E.PC
co2_percap <- wb_data(country = "countries_only", 
                      indicator = "EN.ATM.CO2E.PC", 
                      start_date = 1970, 
                      end_date = 2022,
                      return_wide=FALSE) %>% 
  filter(!is.na(value)) %>% 
  #drop unwanted variables
  select(-c(unit, obs_status, footnote, last_updated))

# get a list of countries and their characteristics
# we just want to get the region a country is in and its income level
countries <-  wb_cachelist$countries %>% 
  select(iso3c,region,income_level)

This is a very rich data set, not just for energy and CO2 data, but for many other technologies. In our case, we just need to produce a couple of graphs– at this stage, the emphasis is on data manipulation, rather than making the graphs gorgeous.

First, produce a graph with the countries with the highest and lowest % contribution of renewables in energy production. This is made up of elec_hydro, elec_solar, elec_wind, and elec_renew_other. You may want to use the patchwork package to assemble the two charts next to each other.

NOTE: We decided to filter on year == 2019 only to replicate the graphs shown since the subtitle in the plots above says “2019 data”.

# Create country's proportion of total energy production that stems from renewable sources
# We used 2019 data only (see NOTE above code chunk)

renewables_summary <- energy %>%
  select(-label) %>%
  filter(year == 2019) %>%
  pivot_wider(names_from = variable, values_from = value) %>% #Transform to wide form dataset
  mutate(renewables_proportion = round((elec_hydro + elec_solar + elec_wind + elec_renew_other) / elecprod, 3) * 100) %>% #Create proportion variable.  Take note that we round to 3 here to perfectly replicate the graph shown above. Any other value will return a slightly different plot
  arrange(desc(renewables_proportion)) %>% 
  ungroup() %>%
  filter(renewables_proportion > 0)

# Extract top 20 countries by renewable energy proportion
top_20_renewables <- renewables_summary %>% 
  slice_max(n = 20, order_by = renewables_proportion)

# Extract bottom 20 countries by renewable energy proportion  
bottom_20_renewables <- renewables_summary %>% 
  slice_min(n = 20, order_by = renewables_proportion)

# Store a plot of top 20 countries reordered on proportion
top_20_plot <- ggplot(top_20_renewables, aes(x = renewables_proportion, y = fct_reorder(country, renewables_proportion))) +
  geom_col(orientation = "y")  +
  scale_x_continuous(labels = scales::percent_format(scale = 1)) +
  labs(title = str_wrap("Highest % of renewables in energy production", 30), x = "Percent") +
  theme(plot.title = element_text(size = 12, hjust = 0.5),
        axis.title.x = element_text(size = 10),
        axis.title.y = element_blank(),
        axis.text.x = element_text(size = 10),
        axis.text.y = element_text(size = 5))

# Store a plot of bottom 20 countries reordered on proportion
bottom_20_plot <- ggplot(bottom_20_renewables, aes(x = renewables_proportion, y = fct_reorder(country, renewables_proportion))) +
  geom_col(orientation = "y")  +
  scale_x_continuous(labels = scales::percent_format(scale = 1)) +
  labs(title = str_wrap("Lowest % of renewables in energy production", 30), x = "Percent") +
  theme(plot.title = element_text(size = 12, hjust = 0.5),
        axis.title.x = element_text(size = 10),
        axis.title.y = element_blank(),
        axis.text.x = element_text(size = 10),
        axis.text.y = element_text(size = 5))

# Use patchwork package to create comparison plot
top_20_plot + bottom_20_plot +
  plot_annotation(title = "Highest and lowest % of renewables in energy production",
                  subtitle = "2019 data") +
  theme(plot.title = element_text(size = 14),
        plot.subtitle = element_text(size = 12))

Second, you can produce an animation to explore the relationship between CO2 per capita emissions and the deployment of renewables. As the % of energy generated by renewables goes up, do CO2 per capita emissions seem to go down?

Answer:

From observing the animation, 2 features are striking.

The first would be that we observe an inverse relationship between C2 emissions and % renewables; that is, when a larger proportion of a country’s energy production comes from renewable sources, the lower the CO2 emissions for that country and vice-versa. This is represented by the points that move along a north-west to south-east line in each of the income group plots. This can mostly be seen in all facets except “Low Income”; these countries had CO2 emissions that stayed pretty much constant no matter how their % renewables changed.

The second pattern we notice is that for countries in the High Income group, there exist a few select countries in which % renewable energy production proportion remained relatively constant despite CO2 emissions fluctuating drastically (CO2 emissions changed perhaps due to compliance to climate change regulations); these are also the same countries with the highest CO2 emissions amongst all other countries. Likewise, many other countries changed the extent to which they relied on renewable energy in their energy output but had no change to CO2 emissions; these were mainly countries with relatively lower CO2 emissions to begin with. This pattern implies that, for these countries, there exists almost no correlation between CO2 per cap and % renewables across time.

To create this animation is actually straight-forward. You manipulate your date, and the create the graph in the normal ggplot way. the only gganimate layers you need to add to your graphs are

  labs(title = 'Year: {frame_time}', 
       x = '% renewables', 
       y = 'CO2 per cap') +
  transition_time(year) +
  ease_aes('linear')
# Rename columns in co2 dataset for ease
co2_percap <- co2_percap %>% 
  rename(year = date) %>% 
  rename(emissions = value)

# Create full dataset of energy, income and co2 emissions using join functions
energy_income_co2 <- energy %>% 
  left_join(., countries, by = "iso3c") %>% 
  left_join(., select(co2_percap, -c(iso2c, indicator, indicator_id)), by = c("iso3c", "year", "country")) %>%
  mutate(year = as.integer(year)) #Change year to integer so that the animation correctly reflects yearly change

# Create renewables proportion of countries across years
renewables_proportion_summary <- energy_income_co2 %>%
  select(-label) %>% 
  pivot_wider(names_from = variable, values_from = value) %>% #Transform to wide data
  select(-c(elec_coal, elec_cons, elec_gas, elec_nuc, elec_oil, electric_gen_capacity)) %>% #ALL GOOD TIL NOW
  group_by(country, year) %>% 
  summarise(renewables_proportion = (elec_hydro + elec_solar + elec_wind + elec_renew_other) / elecprod * 100,
            income_level = first(income_level), n = n(),
            emissions = mean(emissions)) %>%   #Create summary statistics that will be plotted
  filter(!(is.na(income_level))) %>%
  filter(year >= 1990)

# Plot faceted (income level) animation of emissions againsnt renewables proportion across years
ggplot(renewables_proportion_summary, aes(renewables_proportion, emissions, colour = factor(income_level))) +
  geom_point(show.legend = FALSE) + #Create scatterplot
  facet_wrap(~ income_level) + #Facet on incomem level
  scale_x_continuous(labels = scales::percent_format(scale = 1)) +
  labs(title = 'Year: {frame_time}',
       x = '% renewables',
       y = 'CO2 per cap') +
  transition_time(year) +
  ease_aes('linear') #Animation controls

5 Deliverables

As usual, there is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.

6 Details

  • Who did you collaborate with: Alex Scheuer, Dhruvi Mundra, Heng Jian Shun, Marta Wnek, Sharon Wenyu Xu, Xueying Liu
  • Approximately how much time did you spend on this problem set: 20 hours
  • What, if anything, gave you the most trouble: It was better this time!

Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

Yes, we understand the code!

7 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.